First let’s find out who is the customer
round(prop.table(table(nyc_bike_trips$usertype,
nyc_bike_trips$gender))*100, 2)
##
## female male unknown
## casual 5.54 8.16 9.62
## member 22.93 52.33 1.42
round(prop.table(table(nyc_bike_trips$age_grp,
nyc_bike_trips$usertype))*100, 2)
##
## casual member
## Under30 8.66 23.46
## 30-44 4.34 31.42
## 45-59 10.21 17.09
## 60plus 0.12 4.71
round(prop.table(table(nyc_bike_trips$age_grp,
nyc_bike_trips$gender))*100, 2)
##
## female male unknown
## Under30 11.50 20.25 0.37
## 30-44 10.67 24.67 0.41
## 45-59 5.00 12.07 10.23
## 60plus 1.30 3.51 0.02
From the above tables we see that about 60% of total trips are made male, about 99% of total trips made by casual users are from people under the age 60 years old, about 99% of total trips made by customers with unknown gender are in the 45-59 age group, and around 77% of the total trips are made by member users.
# Let's find total trip time by user type
ttime <- nyc_bike_trips %>% group_by(usertype) %>%
summarise(total_time = sum(tripduration))
ttime
## # A tibble: 2 x 2
## usertype total_time
## <chr> <dbl>
## 1 casual 7651549394
## 2 member 13917547774
We see up here, 35.5% of total trip duration are contributed by casual users, even though they account only for about 23% of total trips.
Before we start visualizing the data, we need to fix the order of weekday and month.
nyc_bike_trips$weekday <- ordered(nyc_bike_trips$weekday,
levels=c( "Monday", "Tuesday", "Wednesday",
"Thursday", "Friday", "Saturday", "Sunday"))
nyc_bike_trips$Month <- ordered(nyc_bike_trips$Month,
levels=c( "Jan", "Feb", "Mar","Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))
For the next four plots I decided to use line charts instead of bar charts because we want to find trends of each user type. In the two first plot we will be focused on the number of rides and average time of a ride by days of the week.
nyc_bike_trips %>%
group_by(usertype, weekday) %>%
summarise(rides = n()) %>%
arrange(usertype, weekday) %>%
ggplot(aes(x = weekday, y = rides,
group = usertype, color = usertype)) + geom_line() + geom_point() +
labs(title = "Total trips by usertype Vs. Days of the week") +
xlab("Days of the week") + ylab("Number of rides") +
scale_color_manual(values=c('blue','red'))
## `summarise()` has grouped output by 'usertype'. You can override using the `.groups` argument.
We see in the above figure that the number of total trips made by member users are higher than total trips made by casual users in every day of the week. For casual users the number of rides are higher on Saturday and Sunday while it is lower for member users.
nyc_bike_trips %>%
group_by(usertype, weekday) %>%
summarise(average_duration = mean(tripduration)) %>%
arrange(usertype, weekday) %>%
ggplot(aes(x = weekday, y = average_duration,
group = usertype, color = usertype)) + geom_line() + geom_point()+
labs(title = "Days of the week Vs. Average trip duration") +
xlab("Days of the week") + ylab("Average ride duration in sec.") +
scale_color_manual(values=c('blue','red'))
## `summarise()` has grouped output by 'usertype'. You can override using the `.groups` argument.
Above, we see that the average trip duration higher for casual users than member users for every day of the week and two lines look identical.
nyc_bike_trips %>%
group_by(usertype, Hour) %>%
summarise(rides = n()) %>%
arrange(usertype, Hour) %>%
ggplot(aes(x = Hour, y = rides,
group = usertype, color = usertype)) + geom_line()+ geom_point() +
labs(title = "Total trips by usertype through the day") +
xlab("Time of the day") + ylab("Number of rides") +
scale_color_manual(values=c('blue','red'))
## `summarise()` has grouped output by 'usertype'. You can override using the `.groups` argument.
In the above plot we see that from midnight to 5am the total number of rides are very close between the two groups, but after 5am the difference grows exponentially faster. For both groups the total number trips peak at 5pm.
nyc_bike_trips %>%
group_by(usertype, Hour) %>%
summarise(average_duration = mean(tripduration)) %>%
arrange(usertype, Hour) %>%
ggplot(aes(x = Hour, y = average_duration,
group = usertype, color = usertype)) + geom_line() + geom_point() +
labs(title = "Average trip duration by usertype through the day") +
xlab("Time of the day") + ylab("Average trip duration in sec.") +
scale_color_manual(values=c('blue','red'))
## `summarise()` has grouped output by 'usertype'. You can override using the `.groups` argument.
We see above that the average duration time are higher for casual users than member users at every time of the day.
Now let’s get our focus on the whole year, to do that we will use calendar heat map.
nyc_bike_trips %>%
group_by(weekday,week, usertype) %>%
summarise(rides = n())%>%
ggplot(aes(x = week, y = weekday, fill = rides)) +
viridis::scale_fill_viridis(name="NYC City Bike") +
geom_tile(color = 'white', size = 0.1 ) +
coord_fixed(ratio = 2) +
scale_fill_gradient(low="green", high="red") +
scale_x_continuous( expand = c(0, 0), breaks = seq(1, 52, length = 12),
labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec")) + facet_grid(usertype ~.)+
labs(title = "Calendar heatmap of total trips")
## `summarise()` has grouped output by 'weekday', 'week'. You can override using the `.groups` argument.
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
We see above, that the total number of rides are higher for member users than casual users for every day of week and every week of the year. For casual users the total number of trips is higher on Saturday and Sunday from March to November. In the other hand, the total number of trips are on the weekdays except March and April.
nyc_bike_trips %>%
group_by(weekday,week, usertype) %>%
summarise(av_duration = mean(tripduration))%>%
ggplot(aes(x = week, y = weekday, fill = av_duration)) +
viridis::scale_fill_viridis(name="NYC City Bike") +
geom_tile(color = 'white', size = 0.1 ) +
coord_fixed(ratio = 2) +
scale_fill_gradient(low="green", high="red") +
scale_x_continuous( expand = c(0, 0), breaks = seq(1, 52, length = 12),
labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))+facet_grid(usertype ~.)+
labs(title = "Calendar heatmap of average trip duration in seconds")
## `summarise()` has grouped output by 'weekday', 'week'. You can override using the `.groups` argument.
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
We can see from the above heat map that the average trip duration are higher for casual users than member users for every day of week and every week of the year. The average is little higher for both groups for Saturday and Sunday, except March to June.
nyc_bike_trips %>%
group_by(weekday,week, usertype) %>%
summarise(med_duration = median(tripduration))%>%
ggplot(aes(x = week, y = weekday, fill = med_duration)) +
viridis::scale_fill_viridis(name="NYC City Bike") +
geom_tile(color = 'white', size = 0.1 ) +
coord_fixed(ratio = 2) +
scale_fill_gradient(low="green", high="red") +
scale_x_continuous( expand = c(0, 0), breaks = seq(1, 52, length = 12),
labels = c("Jan", "Feb", "Mar", "Apr", "May", "Jun",
"Jul", "Aug", "Sep", "Oct", "Nov", "Dec"))+facet_grid(usertype ~.)+
labs(title = "Calendar heatmap of median trip duration in seconds")
## `summarise()` has grouped output by 'weekday', 'week'. You can override using the `.groups` argument.
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
We can see from the above heat map that the median trip duration are higher for casual users than member users for every day of week and every week of the year. For both groups it is higher on Saturday and Sunday, except March to June for casual users.
To really understand this the data we need to see where the activities are happening.
First we need to create a New York City stamen map
nyc_bb <- c(left = min(nyc_bike_trips$start_lng),
bottom = min(nyc_bike_trips$start_lat),
right = max(nyc_bike_trips$start_lng),
top = max(nyc_bike_trips$start_lat))
nyc_st <- get_stamenmap(bbox = nyc_bb, zoom = 12, maptype = "terrain")
## Source : http://tile.stamen.com/terrain/12/1205/1537.png
## Source : http://tile.stamen.com/terrain/12/1206/1537.png
## Source : http://tile.stamen.com/terrain/12/1207/1537.png
## Source : http://tile.stamen.com/terrain/12/1205/1538.png
## Source : http://tile.stamen.com/terrain/12/1206/1538.png
## Source : http://tile.stamen.com/terrain/12/1207/1538.png
## Source : http://tile.stamen.com/terrain/12/1205/1539.png
## Source : http://tile.stamen.com/terrain/12/1206/1539.png
## Source : http://tile.stamen.com/terrain/12/1207/1539.png
## Source : http://tile.stamen.com/terrain/12/1205/1540.png
## Source : http://tile.stamen.com/terrain/12/1206/1540.png
## Source : http://tile.stamen.com/terrain/12/1207/1540.png
Let’s find the top 100 popular stations by usertype
# Sorting stations by the total number of ride
stations <- nyc_bike_trips %>%
group_by(start_lng, start_lat, startstation_id, startstation_name,usertype) %>%
summarize(nTrips = n()) %>%
arrange(desc(nTrips))
## `summarise()` has grouped output by 'start_lng', 'start_lat', 'startstation_id', 'startstation_name'. You can override using the `.groups` argument.
# Getting the top 100 popular stations for member users
top_100member <- stations %>% filter(usertype == "member")
top_100member <- top_100member[1:100,]
# Getting the top 100 popular stations for casusal users
top100Causual <- stations %>% filter(usertype == "casual")
top100Causual <- top100Causual[1:100,]
There are more than 1000 bike stations in New York city, we will focus only on the top 100 of each usertype.
Now let’s see the top 100 for member users
ggmap(nyc_st, darken = c(.3,"#FFFFFF")) +
geom_point(data = top_100member %>%
group_by(longitude = start_lng, latitude = start_lat) %>%
summarize(rides = sum(nTrips)), aes(x = longitude, y = latitude,
color = rides),size = 2, alpha = 1.5) +
scale_colour_gradient(low = "blue", high = "red") +
coord_cartesian(xlim = c(-74.021, -73.92), ylim = c(40.65, 40.81))+
xlab("Longitude") + ylab("Latitude") +
labs(title = "The top 100 stations for member users")
## `summarise()` has grouped output by 'longitude'. You can override using the `.groups` argument.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
For member users we see that most of the top 100 stations are north of Canal Street and south of 79 Street. Only 2 stations in Brooklyn the only ones outside of Manhattan. The stations are tight to each other and specially in the area between north of Delancy street, south 42 street, west of First Ave. and East of 10th Ave.
The top 100 stations for casual users
ggmap(nyc_st, darken = c(.3,"#FFFFFF")) +
geom_point(data = top100Causual %>%
group_by(longitude = start_lng, latitude = start_lat) %>%
summarize(rides = sum(nTrips)), aes(x = longitude, y = latitude,
color = rides),size = 2, alpha = 1.5) +
scale_colour_gradient(low = "blue", high = "red") +
coord_cartesian(xlim = c(-74.021, -73.92), ylim = c(40.65, 40.81))+
xlab("Longitude") + ylab("Latitude") +
labs(title = "The top 100 stations for casual users")
## `summarise()` has grouped output by 'longitude'. You can override using the `.groups` argument.
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
The top 100 stations for casual users are more spaced compared to member users. We notice also that the stations around Central Park in Manhattan and Prospect Park in Brooklyn are included here. We have more stations outside of Manhattan, and more around the bridges.
After looking at the most popular stations we need to see where people rides.
Let’s see the total routes for the year, because there thousands of different routes we can not plot each of them but we will a density map to see the stations where most routes are originating from.
# Creating routes for all user type
routes <- nyc_bike_trips %>%
filter(startstation_id != endstation_id) %>%
group_by(start_lng, start_lat, end_lng, end_lat, usertype, gender) %>%
summarise(total = n(),.groups="drop")
#Then we create two sub tables for each user type
Casual <- routes %>% filter(usertype == "casual")
Member <- routes %>% filter(usertype == "member")
# Density plot routes for member users
ggmap(nyc_st) +
stat_density_2d(
data = Member,
aes(x = start_lng, y = start_lat, fill = ..level..), alpha = 0.5,
geom = "polygon") + scale_fill_gradient(low="pink", high="purple4") +
facet_grid(~ gender) +
xlab("Longitude") + ylab("Latitude") +
labs(title = "Density plot routes for member users by gender")
From above we see that the most popular routes originates south of 59 street, east of 10th ave. and West street for all genders.
ggmap(nyc_st) +
stat_density_2d(
data = Casual,
aes(x = start_lng, y = start_lat, fill = ..level..), alpha = 0.5,
geom = "polygon") + scale_fill_gradient(low="pink", high="purple4") +
facet_grid(~ gender) +
xlab("Longitude") + ylab("Latitude") +
labs(title = "Density plot routes for casual users by gender")
In opposed to member users routes, the most popular routes from casual users are almost all over Manhattan, Downtown Brooklyn and Williamsburg Brooklyn for all genders. To better understand user type routes let’s get focused on the routes that are greater than 200 which is different than top 200.
# Routes used more than 200 times
routes200 <- routes %>% filter(total > 200)
#Then we create two sub tables for each user type
Casual200 <- routes200 %>% filter(usertype == "casual")
Member200 <- routes200 %>% filter(usertype == "member")
dim(Casual200)
## [1] 220 7
dim(Member200)
## [1] 9017 7
Casual200 has only 220 observations, then we can plot the routes themself, for Member200 because we have 9017 observations we will use density plot.
ggmap(nyc_st) +
stat_density_2d(
data = Member200,
aes(x = start_lng, y = start_lat, fill = ..level..), alpha = 0.5,
geom = "polygon") + scale_fill_gradient(low="pink", high="purple4") +
facet_grid(~ gender)+
xlab("Longitude") + ylab("Latitude") +
labs(title = "Density plot for routes used more than 200 times\n by member users")
## Warning: Computation failed in `stat_density2d()`:
## missing value where TRUE/FALSE needed
We see we don’t have member users with unknown gender. For male users it is the almost same area as looking at the previous density for member. As for women is south of 42 street and north of Delancey Street.
ggmap(nyc_st, darken = c(.5,"white")) +
geom_curve(Casual200, mapping = aes(x = start_lng, y = start_lat,
xend = end_lng, yend = end_lat, alpha= 1, color = gender),
size = .8, curvature = .2,arrow = arrow(length=unit(0.2,"cm"),
ends="first", type = "closed")) + coord_cartesian() +
labs(title = "Density plot for routes used more than 200 times by casual users")
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Interesting, in the above plot we see that casual users with unknown gender popular routes includes many touristic attractions such as Governors Island (Statue Liberty), Central Park, Battery Park, all the touristic areas and Ferries on West street, and the Brooklyn Bridge. Also popular with the same group are Prospect Park, Downtown Brooklyn, Williamsburg Bridge, Queensboro Bridge, and Roosevelt Island. We have male and female on the bike trail on West street.
Datasets by quarter
write.csv(q1_2020, “D:\Documents\Google Analytics\Case Study\q1_2020.csv”)
write.csv(q2_2020, “D:\Documents\Google Analytics\Case Study\q2_2020.csv”)
write.csv(q3_2020, “D:\Documents\Google Analytics\Case Study\q3_2020.csv”)
write.csv(q4_2020, “D:\Documents\Google Analytics\Case Study\q4_2020.csv”)
The 3.1 GB dataset
write.csv(nyc_bike_trips, “D:\Documents\Google Analytics\Case Study\citibike_2020.csv”)